library(here)
library(scales)
library(forcats)
if (!require("ggVennDiagram")) install.packages("ggVennDiagram", repos = "http://cran.us.r-project.org")
library(ggVennDiagram)

resultsTidy <- readRDS(here("data/wrangled_data/resultsTidy.rds"))
source(here("resources/scripts/shared_functions.R"))

Main Analysis and Insights

Identify User Type

Takeaway: Of the 50 responses, 22 were returning users and 28 were potential users. The majority of returning users belonged to the group who use the AnVIL for ongoing projects while the majority of potential users were evenly split between those who have never used the AnVIL (but have heard of it) and those who used to previously use the AnVIL, but don’t currently.

Potential Follow-ups:

  • Look to see if those potential users who previously used to use the AnVIL show similarity in overall trends with the rest of the potential users
  • Directly ask why they no longer use the AnVIL (Elizabeth mentioned the possibility that the AnVIL is sometimes used in courses or workshops and students may not use it after that)

Prepare and plot the data

Description of variable definitions and steps

First, we group the data by the assigned UserType labels/categories and their related more detailed descriptions. Then we use summarize to count the occurrences for each of those categories. We use a mutate statement to better fit the detailed descriptions on the plot. We then send this data to ggplot with the count on the x-axis, and the usage descriptions on the y-axis (ordered by count so highest count is on the top). We fill with the UserType description we’ve assigned. We manually scale the fill to be AnVIL colors and specify we want this to be a stacked bar chart. We then make edits for the theme and labels and finally add a geom_text label for the count next to the bars before we save the plot.

Demographics: Highest Degree

Takeaway: Most of the respondents have a PhD or are currently working on a PhD, though a range of career stages are represented.

Prepare and plot the data

Description of variable definitions and steps

First we use group_by() to selectDegrees and UserType in conjunction with summarize( = n()) to add counts for how many of each combo are observed in the data.

Then we send this data to ggplot and make a bar chart with the x-axis representing the degrees (reordered by the count number such that higher counts are first (and the sum) because otherwise the 2 MDs are located after the high school and master’s in progress bars (1 each)). The y-axis represents the count, and the fill is used to specify user type (returning or potential AnVIL users). We use a stacked bar chart and include labels above each bar of the total sum for that degree type.

Used this Stack Overflow post to label sums above the bars

and used this Stack Overflow post to remove NA from the legend

The rest of the changes are related to theme and labels and making sure that the numerical bar labels aren’t cut off on the top.

Demographics: Kind of Work

Takeaway: Only a few responses report project management, leadership or administration as their only kind of work. This increases our confidence that this won’t confound later questions asking about usage of datasets or tools.

Potential Follow-ups:

  • Use this information (together with other info?) to try to cluster respondents/users into personas; see PersonaStats.Rmd

Prepare and plot the data

Description of variable definitions and steps

Demographics: Institutional Affiliation

Takeaway:

Prepare and plot the data

Description of variable definitions and steps

First, we set the factor level for the further simplified institutional type column (FurtherSimplifiedInstitutionalType) so that we know the order on the y-axis when plotting. We then use group_by() together with summarize() to count the number of each further simplified institutional type for each UserType. We plot this as a bar plot with the institutional type on the y-axis and the count on the x-axis and fill the stacked bars according to UserType. We add text labels to the bars displaying the sum of the institutional type. We also use custom annotation grobs that break down which institutional types are part of each further simplified institutional type (as defined in TidyData.Rmd). Note the liberal uses of spaces to try to align these sub-labels. Finally, we pass the plot to the shared function stylize_bar() to change axis labels, fill colors, etc.

Demographics: Consortia Affiliations

Takeaway: Of 50 responses, 21 provide an affiliation, with 25 unique affiliations represented across those responses (respondents could select more than one consortium). The following table shows the most represented consortia.

Prepare and display the data

consortium count
CCDG 2
GTEx 2
GREGoR 3
PRIMED 3
eMERGE 3

Experience: Tool & Resource Knowledge/Comfort level

Takeaway: Except for Galaxy, potential users tend to report lower comfort levels for the various tools and technologies when compared to returning users. Where tools were present on and off AnVIL, returning users report similar comfort levels.

Overall, there is less comfort with containers or workflows than using various programming languages and integrated development environments (IDEs).

Prepare and plot the data

Description of variable definitions and steps for preparing the data

We bind the rows of two dataframes, one for returning users and one for potential users. The steps for building the dataframes are essentially the same once the first filter and mutate steps are completed. The first step of building each data frame is to filter based on the UserType of interest. We then select the columns that start with Score_ or Score_AllTech that we created in TidyData.Rmd. For potential users, we only need the Score_AllTech columns, not the Score_ReturningAnVILTech columns as well. Because the scores are integers and we want to sum the scores across responses, we use a column sum function and send those sums to a data frame where the rowname is the previous column name and the summed scores are stored in the totalScore column. We add columns nscores, avgScore, and UserType that store the number of responses or scores, the average score (total divided by number of), and the applicable type of user. Rownames are then moved to a column called WhereTool and this column is separated into two separate columns, separating on the word “Tech” Such that the new AnVILorNo column will contain either Score_All or Score_ReturningAnVIL. We translate those to be “Separate from the AnVIL” or “On the AnVIL” respectively. And the new Tool column will contain the shorthand tool names which we recode to add spaces or more info.

Description of variable definitions and steps for plotting the dumbbell like plot

Used this Stack Overflow response to get the values for the scale_shape_manual()

Experience: Types of Data Analyzed

Question and possible answers

What types of data do you or would you analyze using the AnVIL?

Possible answers include

  • Genomes/exomes
  • Transcriptomes
  • Metagenomes
  • Proteomes
  • Metabolomes
  • Epigenomes
  • Structural
  • Single Cell
  • Imaging
  • Phenotypic
  • Electronic Health Record
  • Metadata
  • Survey
  • Other (with free text response)

Takeaway:

Prepare and plot the data

Description of variable definitions and steps for preparing the data
Description of variable definitions and steps for plotting the bar graphs

Experience: Genomics and Clinical Research Experience

Takeaway: 21 respondents report that they are extremely experienced in analyzing human genomic data, while only 6 respondents report that they are not at all experienced in analyzing human genomic data. However, for human clinical data and non-human genomic data, more respondents report being not at all experienced in analyzing those data than report being extremely experienced.

Potential Follow-ups

  • What’s the overlap like for those moderately or extremely experienced in these various categories? (Note: Found in the supplemental analyses)
Question and possible answers

How much experience do you have analyzing the following data categories?

The data categories were

  • Human genomic
  • Non-human genomic
  • Human clinical

and for each category, possible options were

  • Not at all experienced
  • Slightly experienced
  • Somewhat experienced
  • Moderately experienced
  • Extremely experienced

Prepare and plot the data

Description of variable definitions and steps for preparing the data

Here we select the columns containing answers for each data category: HumanGenomicExperience, HumanClinicalExperience, and NonHumanGenomicExperience. We also select UserType in case we want to split user type out at all in viewing the data. We use a pivot_longer to make a long dataframe that can be grouped and groups counted. The category/column names go to a new column, researchType and the values in those columns go to a new column experienceLevel. Before we use group by and count, we set the factor level on the new experienceLevel column to match the progression from not at all experienced to extremely experienced, and we rename the research categories so that the words have spaces, and we say research instead of experience. Then we use group_by and summarize to add counts for each combination of research category, experience level, and UserType. These counts are in the new n column.

Description of variable definitions and steps for plotting the bar graph

We didn’t observe big differences between returning and potential users, so we believe this grouped plot is useful for understanding the community as a whole.

This bar plot has the experience level on the x-axis, the count on the y-axis, and fills the bars according to the experience level (though the fill/color legend is turned off by setting legend.position to none). We facet the research category type and label the bars. We keep a summary stat and sum function and after_stat(y) for the label since the data has splits like UserType that we’re not visualizing here.

We adjust various aspects of the theme like turning off the grid and background and rotating the x-tick labels and changing the x- and y-axis labels. We also slightly widen the left axis so that the tick labels aren’t cut off.

Experience: Controlled Access Datasets

Takeaway: Generally, over half of respondents report they are extremely interested in working with controlled access datasets.

For specific controlled access datasets …

  • Of the survey provided choices, respondents have accessed or are particularly interested in accessing All of Us, UK Biobank, and GTEx (though All of Us and UK Biobank are not currently AnVIL hosted).
  • 2 respondents (moderately or extremely experienced with genomic data) specifically wrote in “TCGA”.
  • The trend of All of Us, UK Biobank, and GTEx being chosen the most is consistent across all 3 research categories (moderately or extremely experienced with clinical, human genomic, or non-human genomic data).
Question and possible answers

What large, controlled access datasets do you access or would you be interested in accessing using the AnVIL?

  • All of Us*
  • Centers for Common Disease Genomics (CCDG)
  • The Centers for Mendelian Genomics (CMG)
  • Clinical Sequencing Evidence-Generating Research (CSER)
  • Electronic Medical Records and Genomics (eMERGE)
  • Gabriella Miller Kids First (GMKF)
  • Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR)
  • The Genotype-Tissue Expression Project (GTEx)
  • The Human Pangenome Reference Consortium (HPRC)
  • Population Architecture Using Genomics and Epidemiology (PAGE)
  • Undiagnosed Disease Network (UDN)
  • UK Biobank*
  • None
  • Other (Free Text Response)

Since this is a select all that apply question, we expect that there will be multiple responses that are comma separated. The free text responses will likely need recoded as well. The responses are in the AccessWhichControlledData column.

Description of variable definitions and steps for preparing bar plot

Description of variable definitions and steps for preparing the data

Using a function prep_df_whichData() which is in the shared_functions.R script since we’ll be using this workflow a few times for different subsets of the data, because we want to be able to differentially display the data based on the experience status (experienced with clinical research, human genomics research, etc.) of the person saying they’d like access to the data.

We want to color the bars based on whether or not the controlled access dataset is available on the AnVIL currently. We create a dataframe onAnVILDF to report this. Used the AnVIL dataset catalog/browser to find out this information. However, HPRC and GREGoR don’t show up in that resource, but are both available per these sources: Announcement for HPRC, Access for HPRC, Access for GREGoR. Both GMKF and TCGA are data hosted on other NCPI platforms that are accessible via AnVIL because of interoperability. (See: https://www.ncpi-acc.org/ and https://ncpi-data.org/platforms). We list these as non-AnVIL hosted since while accessible, they are not AnVIL hosted and inaccessible without NCPI. Finally, UDN is described as non-AnVIL hosted as it is in the Data submission pipeline and not yet available.

We’ll join this anvil-hosted or not data with the actual data at the end.

Given the input subset_df, we expect several answer to be comma separated. Since there are 12 set possible responses (not including “None”) and one possible free response answer, we separate the AccessWhichControlledData column into 13 columns (“WhichA” through “WhichN”), separating on a comma (specifically a “,” a comma followed by a space, otherwise there were duplicates where the difference was a leading space). Alternative approaches should consider using str_trim. We set fill to “right” but this shouldn’t really matter. It’s just to suppress the unnecessary warning that they’re adding NA’s when there aren’t 13 responses. If there’s only one response, it’ll put that response in WhichA and fill the rest of them with NA. If there’s two responses, it’ll put those two responses in WhichA and WhichB and fill the rest of them with NA… etc,

We then use pivot_longer to grab these columns we just made and put the column names in a new column WhichChoice and the values in the each column to a new column whichControlledAccess. We drop all the NAs in this new whichControlledAccess column (and there’s a lot of them there)…

Then we group by the new whichControlledAccess column and summarize a count for how many there are for each response.

Then we pass this to a mutate and recode function to simplify the fixed responses to be just their acronyms, to remove asterisks (that let the survey respondent know that that dataset wasn’t available because of policy restrictions), and to recode the free text responses (details below in “Notes on free text response recoding”).

We use a left_join() to join the cleaned data with a dataframe that specifies whether that dataset is currently available on the AnVIL or not. It’s a left join rather than a full join so it’s only adding the annotation for datasets that are available in the results.

Finally, we return this subset and cleaned dataframe so that it can be plotted.

Additional notes on free text response recoding

There were 4 “Other” free response responses

  • “Being able to pull other dbGap data as needed.” –> We recoded this to be an “Other”
  • “GnomAD and ClinVar” –> GnomAD and ClinVar are not controlled access datasets so we recoded that response to be “None”
  • “Cancer omics datasets” –> We recoded this to be an “Other”
  • “TCGA” –> This response was left as is since there is a controlled access tier.
Description of variable definitions and steps for preparing the data continued

Here we set up 4 data frames for plotting

  • The first uses all of the responses and sends them through the prep_df_whichData() function to clean the data for plotting to see which controlled access datasets are the most popular.
  • The second filters to grab just the responses from those experienced in clinical research using the clinicalFlag column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
  • The third filters to grab just the responses from those experienced in human genomic research using the humanGenomicFlag column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
  • The fourth filters to grab just the responses from those experienced in non-human genomic research using the nonHumanGenomicFlag column (described earlier in the Clean Data -> Simplified experience status for various research categories (clinical, human genomics, non-human genomics) subsection)
Description of variable definitions and steps for plotting the bar graphs

Also have a function from shared_functions.R for this because it’s the same plotting steps for each just changing the subtitle and which dataframe is used as input.

This takes the input dataframe and plots a bar plot with the x-axis having the controlled access datasets listed (reordering the listing based off of the count so most popular is on the left), the count number/popularity of requested is on the y-axis, and the fill is based on whether the dataset is available on AnVIL or not.

We change the theme elements like removing panel borders, panel background, and panel grid, and rotate the x-axis tick labels. We add an x- and y- axis label and add a title (and subtitle if specified - which it will be when we’re looking at just a subset like those who are experienced with clinical data)

We also add text labels above the bars to say how many times each dataset was marked/requested. Note that we have to use the after_stat, summary, and sum way of doing it again because we use recoding and if we want the labels to be accurate, it has to capture every time we’ve recoded things to be the same after we used group_by and summarize to count before we recoded. It uses coord_cartesian(clip = "off") so these bar text labels aren’t cut off and finally returns the plot.

We call this function 4 times

  • once for all the data (and don’t use a subtitle)
  • next for just those experienced with clinical data (using a subtitle to specify this)
  • next for just those experienced with human genomic data (using a subtitle to specify this)
  • and finally for just those experienced with non-human genomic data (using a subtitle to specify this)

Awareness: Monthly AnVIL Demos

Takeaway: Most respondents have not attended an AnVIL Demo. To investigate whether this is an awareness issue, we aggregated all responses except No, didn't know of. We see that the majority of respondents are aware of AnVIL Demos. These responses are just distributed among different ways of utilizing the demos. Further, there’s awareness among both returning and potential AnVIL users.

Prepare and plot the data

Raw responses

Responses recoded to focus on awareness

Awareness: AnVIL Support Forum

Takeaway: Most respondents have not used the AnVIL support forum.

  • We aggregated these responses to examine awareness. We observe that there is awareness of the support forum across potential and returning users.
  • While utilization in some form is reported by about 20% of respondents, reading through others’ posts is the most common way of utilizing the support forum within this sample.

Prepare and plot the data

Raw responses

Responses recoded to focus on awareness

Preferences: Feature Importance Ranking

Takeaway: All respondents rate having specific tools or datasets supported/available as a very important feature for using AnVIL. Compared to returning users, potential users rate having a free-version with limited compute or storage as the most important feature for their potential use of the AnVIL.

Question and possible answers

Rank the following features or resources according to their importance for your continued use of the AnVIL

Rank the following features or resources according to their importance to you as a potential user of the AnVIL?

  • Easy billing setup
  • Flat-rate billing rather than use-based
  • Free version with limited compute or storage
  • On demand support and documentation
  • Specific tools or datasets are available/supported
  • Greater adoption of the AnVIL by the scientific community

We’re going to look at a comparison of the assigned ranks for these features, comparing between returning users and potential users.

Prepare and plot the data

Average rank is total rank (sum of given ranks) divided by number of votes (number of given ranks)

Description of variable definitions and steps for preparing the data

We make two different dataframes that find the total ranks (column name: totalRank) and avg ranks (column name: avgRank) for each future and then row bind (bind_rows) these two dataframes together to make totalRanksdf. The reason that we make two separately are that one is for Potential users (starts_with("PotentialRank")) and one is for returning users (starts_with("ReturningRank")). They have a different number of votes nranks and so it made more sense to work with them separately, following the same steps and then row bind them together.

The individual steps for each of these dataframes is to

  • select the relevant columns from resultsTidy
  • perform sums with colSums, adding together the ranks in those columns (each column corresponds to a queried feature); We set na.rm = TRUE to ignore the NAs (since not every survey respondent was asked each question; e.g., if they were a returning user they weren’t asked as a potential user)
  • send those sums to a data frame such that the selected column names from the first step are now the row names and the total summed rank is the only column with values in each row corresponding to each queried feature
  • Use a mutate to
    • add a new column nranks that finds the number of responses in the survey are from potential users (e.g., the number that would have assigned ranks to the PotentialRank questions) or the number of responses in the survey that are from returning users (e.g., the number that would have assigned ranks to the ReturningRank questions).
    • add a new column avgRank that divides the totalRank by the nranks

After these two dataframes are bound together (bind_rows), the rest of the steps are for aesthetics in plotting and making sure ggplot knows the UserType and the feature of interest, etc.

  • We move the rownames to their own column UsertypeFeature (with the mutate(UsertypeFeature = rownames(.))).
  • We separate the values in that column on the word “Rank” to remove the UsertypeFeature column we just made but then make two new columns (Usertype and Feature) where `Usertype is either “Returning” or “Potential”, and the Features are listed in the code below, because…
  • We then use a case_when within a mutate() to fill out those features so they’re more informative and show the choices survey respondents were given.
Description of variable definitions and steps for plotting the dumbbell plot

We use the totalRanksdf we just made. The x-axis is the avgRank values, and the y-axis displays the informative Feature values, however, we reorder the y-axis so that more important (lower number) avgRank features are displayed higher in the plot.

geom_point and geom_line are used in conjunction to produce the dumbbell look of the plot and we set the color of the points to correspond to the Usertype

Some theme things are changed, labels and titles added, setting the color to match AnVIL colors, and then we display and save that plot.

The first version of the plot has trimmed limits, so the second version sets limits on the x-axis of 1 to 6 since those were the options survey respondents were given for ranking. It also adds annotations (using Grobs, explained in this Stack Overflow post answer) to specify which rank was “Most important” and which was “Least important”.

Then we’ve also adjusted the left margin so that the annotation isn’t cut off.

We then display and save that version as well.

Finally, we’ll reverse the x-axis so that most important is on the right and least important is on the left. We use scale_x_reverse() for that. We have to change our group annotations so that they are now on the negative number version of xmin and xmax that we were using previously. We then display and save that version as well.

Preferences: Training Workshop Modality Ranking

Takeaway: Both returning and potential users vastly prefer virtual training workshops.

Question and possible answers

Please rank how/where you would prefer to attend AnVIL training workshops.

Possible answers include

  • On-site at my institution: AnVILTrainingWorkshopsOnSite
  • Virtual: AnVILTrainingWorkshopsVirtual
  • Conference (e.g., CSHL, AMIA): AnVILTrainingWorkshopsConference
  • AnVIL-specific event: AnVILTrainingWorkshopsSpecEvent
  • Other: AnVILTrainingWorkshopsOther

The responses are stored in the starts with AnVILTrainingWorkshops columns

Prepare and plot the data

Description of variable definitions and steps for preparing the data
Description of variable definitions and steps for plotting the dumbbell plot

Preferences: Where analyses are currently run

Takeaway: Institutional HPC and locally/personal computers are the most common responses.

  • Google Cloud Platform (GCP) is reported as used more than other cloud providers within this sample.
  • We also see that potential users report using Galaxy (a free option) more than returning users do.

Prepare and plot the data

Preferences: DMS compliance/data repositories

Preferences: Source for cloud computing funds

Takeaway: NIH funds (NHGRI or otherwise) as well as institutional funds are the most commonly reported funding sources.

Question and possible answers

What source(s) of funds do you use to pay for cloud computing?

Possible answers include

  • NHGRI
  • Other NIH
  • Foundation Grant
  • Institutional funds
  • Don’t know
  • Only use free options
  • Other (with free text entry if Other is selected)

The only Other response in this set of responses is NSF.

Answers are stored in the FundingSources column. This question was a select all that apply, so answers will be comma separated, and this question was asked to all survey takers.

Prepare and plot the data

Prepare the data variable definition and steps
Plot the data variable definition and steps

Returning User: Length of Use of the AnVIL

Takeaway: Respondents have a range of experience on AnVIL.

Returning User: Foreseeable Computational Needs

Takeaway: Of the 22 returning users, all 22 provided an answer to this question. The most common response here is needing large amounts of storage.

Returning User: Recommendation likelihood

Takeaway: There’s a fairly bimodal distribution here with users either extremely likely or only moderately likely to recommend the AnVIL.

Session Info

Session Info
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Indiana/Indianapolis
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] lubridate_1.9.4     stringr_1.5.2       dplyr_1.1.4        
##  [4] purrr_1.1.0         readr_2.1.5         tidyr_1.3.1        
##  [7] tibble_3.3.0        ggplot2_3.5.2       tidyverse_2.0.0    
## [10] magrittr_2.0.4      ggVennDiagram_1.5.2 forcats_1.0.0      
## [13] scales_1.4.0        here_1.0.1         
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.4.0         stringi_1.8.7     
##  [5] hms_1.1.3          digest_0.6.37      evaluate_1.0.3     timechange_0.3.0  
##  [9] RColorBrewer_1.1-3 fastmap_1.2.0      rprojroot_2.1.1    jsonlite_2.0.0    
## [13] viridisLite_0.4.2  textshaping_1.0.1  jquerylib_0.1.4    cli_3.6.5         
## [17] crayon_1.5.3       rlang_1.1.6        bit64_4.6.0-1      withr_3.0.2       
## [21] cachem_1.1.0       yaml_2.3.10        parallel_4.5.0     tools_4.5.0       
## [25] tzdb_0.5.0         kableExtra_1.4.0   vctrs_0.6.5        R6_2.6.1          
## [29] lifecycle_1.0.4    bit_4.6.0          vroom_1.6.5        pkgconfig_2.0.3   
## [33] pillar_1.11.1      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [37] systemfonts_1.2.3  xfun_0.52          tidyselect_1.2.1   rstudioapi_0.17.1 
## [41] knitr_1.50         farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29    
## [45] svglite_2.2.1      labeling_0.4.3     compiler_4.5.0